December 2, 2014

Introduction

  • Testing empirical performance of a variety of bandit algorithms, both contextless and contextful
  • Using the Yahoo! Webscope Today Module data [11], which covers 10 days of article recommendation data on the Yahoo! homepage
  • Building heavily on Li, Chu et al 2011, which uses this dataset to create an unbiased estimator for bandit algorithm performance [7]
    • Main contribution in further investigating characteristics of the specific dataset, and running far more algorithms on it
    • May also clean up implementations of algorithms and post publicly (hard to find actual code)

Motivation

  • Bandit algorithms are not as well empirically tested / benchmarked as most other areas of statistical learning–most current empirical tests are simulations or toy problems [2, 5]
    • Simulations are subject to bias, and obviously work well when underlying distribution matches that used by algorithm
    • Toy problems don't capture complexity of real-world problems
  • Hard to find good benchmarks, since we can't see dataset counterfactuals (off-policy problem). We almost require online problems [7, 9]
  • Yahoo! Today claims to be the first solid benchmark

Yahoo! Today Module

Data Description / Characteristics

  • 45,811,883 distinct visit events over 10 days in 2009
  • 271 articles observed within the course of the 10 days
  • User is shown an article in the feature box
  • Pool of articles is non-constant and can change (varies from 19 to 25)
  • Users and articles have 5 features based on conjoint analysis (and one constant feature)
  • Reward for our bandit algorithms, and primary evaluation metric, will be article clicks and/or clickthrough rate

Problem Formulation

For algorithm/policy \(P\), each trial \(t\) to \(T\) trials:

  1. Algorithm observes user \(u_t\) and a set of \(A_t\) arms (articles) together with context vector \(x_{t,a}\) for \(a \in A_t\). \(x_{t,a}\) contains features and interactions from both user \(u_t\) and arm \(a_t\). Specifically \(x_{t,a} = \mathbf{u} \mathbf{a}^T\) where \(\mathbf{u}\) is the feature vector for user \(u_t\) and \(\mathbf{a}\) is the feature vector for arm \(a_t\)
  2. Based on \(A\)'s observations of rewards per arm for previous trial trials, \(P\) chooses arm from \(A_t\) and recieves payoff \(r_{t,a}\).
  3. Algorithm updates strategy based on new observation. No feedback is received for unchosen arms.

Unbiased Estimator

Policy Evaluator (from Li et al., 2011)

  1. Inputs \(T > 0\); bandit algorithm \(P\); stream of events \(S\)
  2. \(h_0 \leftarrow \emptyset\text{ {empty history}}\)
  3. \(\hat{G_P} \leftarrow 0\text{ {zero payoff to start}}\)
  4. \(\mathbf{for}\text{ }t = 1,2,3,...,T \mathbf{do}\)
  5. \(\;\;\;\;\mathbf{repeat}\)
  6. \(\;\;\;\;\;\;\;\;\text{Get next event }(\mathbf{x}, a, r_a)\text{ from }S\)
  7. \(\;\;\;\;\mathbf{until}\text{ }P(h_{t-1}, \mathbf{x}) = a\)
  8. \(\;\;\;\;h_t \leftarrow\) CONCATENATE\((h_{t-1}, (\mathbf{x}, a, r_a))\)
  9. \(\;\;\;\;\hat{G_P} \leftarrow \hat{G_P} + r_a\)
  10. \(\mathbf{end\text{ }for}\)
  11. Output: \(\hat{G_P}/T\)

Rejection Sampling

  • Policy estimator follows the same general principle
  • Keep observations that fall within the distribution, else reject

Limitations

  • Evaluator is biased for cases where stream \(S\) of data is not finite (premature end)
  • Lots of rejected data in order to get a desired \(T\). On average:
    • 220,000 data points to get 10,000 useable observations
    • 6.7 million data points to get 300,000 useable observations
    • 25 million data points to get 1 million useable observations
  • In this case, limiting \(T\) to 1 million at most, which means \(S\) is effectively infinite
  • Given \(K\) arms, we'd expect \(\frac{K-1}{K}\) discards [Li et al 2011]

Exploring the Dataset

Datapoints Close to Cluster Centers

  • Features sum to one – and in general, one feature dominates each user
  • If clusters are closely linked to different articles, contextual bandits should do better than context-less

Maximum features for each user vector.

Clickthrough Rates for Clusters

Top arms per cluster and their CTRs

General CTR for Same Articles

Same top arms, general CTR ignoring user clusters / side information

Results

Algorithms Compared

Context-Less

  • Random
  • UCB (Lai & Robbins 1985 & Cappé et al 2013) [3, 6]
  • KL-UCB (Cappé et al 2013) [3]
  • Thompson Sampling (Thompson 1933) [10]

Indexed

  • Indexed UCB

Contextual

  • Thompson Sampling (Agrawal & Goyal 2014) [1]
  • LinUCB (Li, Chu et al 2010) [8]
  • GLM-UCB (Filippi 2010) [in progress…] [5]

Results (Context-Less)

  • KL-UCB and Thompson Sampling do the best, matching literature [3, 4]

Results (Contextual)

  • Indexed UCB and contextual Thompson Sampling do worse than expected [1]

Results (Contextual Continued)

  • LinUCB, as we'd expect, performs better with its access to side information

Comparison of Clickthrough Rates

  • LinUCB and GLM-UCB would likely be even better than these results (not shown here, since these are all \(T = 1,000,000\))
##                Policy      CTR
## 1:             KL-UCB 0.054508
## 2:           Thompson 0.054392
## 3:                UCB 0.050476
## 4:         IndexedUCB 0.049134
## 5: ContextualThompson 0.041023
## 6:             Random 0.036651
## 7:             Random 0.036580
## 8: EpsilonGreedy(0.1) 0.035946

Examining Article Selection

Further Work and Investigation

  • Further evaluation on contextful algorithms against what they should have pulled for each user cluster
  • Trying to get LinUCB and GLM-UCB running in some sort of reasonable time
  • Why does contextual Thompson Sampling do so terribly (along with indexed UCB)?

Conclusions

  • Empirical results are mostly what we expect and largely match past, simulation results [2, 4, 5]
  • "Rejection sampling" is a good technique for evaluating various bandit algorithms
  • Limited in some ways to web applications, due to high rejection rate, but will hopefully cause more interest in collecting this type of data

References

[1]Agrawal, S. and Goyal, N. 2012. Thompson sampling for contextual bandits with linear payoffs. CoRR. abs/1209.3352, (2012).

[2]Auer, P., Cesa-Bianchi, N. and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 2-3 (May 2002), 235–256.

[3]Cappé, O., Garivier, A., Maillard, O.-A., Munos, R. and Stoltz, G. 2013. Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics. 41, 3 (Jun. 2013), 1516–1541.

[4]Chapelle, O. and Li, L. An empirical evaluation of thompson sampling.

[5]Filippi, S., Cappe, O., Garivier, A. and Szepesvári, C. 2010. Parametric bandits: The generalized linear case. Advances in neural information processing systems 23. J. Lafferty, C. Williams, J. Shawe-taylor, R. Zemel, and A. Culotta, eds. 586–594.

[6]Lai, T.L. and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics. 6, 1 (1985), 4–22.

[7]Li, L., Chu, W. and Langford, J. 2010. An unbiased, data-driven, offline evaluation method of contextual bandit algorithms. CoRR. abs/1003.5956, (2010).

[8]Li, L., Chu, W., Langford, J. and Schapire, R.E. 2010. A contextual-bandit approach to personalized news article recommendation. CoRR. abs/1003.0146, (2010).

[9]Precup, D., Sutton, R.S. and Singh, S.P. 2000. Eligibility traces for off-policy policy evaluation. Proceedings of the seventeenth international conference on machine learning (iCML 2000) (San Francisco, CA, USA, 2000), 759–766.

[10]Thompson, W.R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. j-biometrika. 25, 3/4 (Dec. 1933), 285–294.

[11]Yahoo! Yahoo! Webscope dataset ydata-frontpage-todaymodule-clicks-v10.